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Abstract 

-i— > 

The smoothing spline is one of the most popular curve-fitting methods, partly because 
of empirical evidence supporting its effectiveness and partly because of its elegant mathe- 
matical formulation. However, there are two obstacles that restrict the use of smoothing 
spline in practical statistical work. Firstly, it becomes computationally prohibitive for large 
data sets because the number of basis functions roughly equals the sample size. Secondly, 
its global smoothing parameter can only provide constant amount of smoothing, which of- 
ten results in poor performances when estimating inhomogeneous functions. In this work, 
we introduce a class of adaptive smoothing spline models that is derived by solving cer- 
tain stochastic differential equations with finite element methods. The solution extends 
the smoothing parameter to a continuous data-driven function, which is able to capture 
the change of the smoothness of underlying process. The new model is Markovian, which 
J> makes Bayesian computation fast. A simulation study and real data example are presented 

to demonstrate the effectiveness of our method. 
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1 Introduction 



The smoothing spline is one of the most popular nonparametric regression methods, partly 
because of empirical evidence supporting its effectiveness and partly because of its elegant math- 
ematical formulation. Consider the model 



Vi = f{U) + £% 



l,...,n; UeT, 



where y — (yi,yz, ■ • • , y n ) is the vector of observations, / is some "smooth" function defined on 
some index set T, and S{ ~ iV(0, t _1 ) with precision (inverse of variance) r. The smoothing 
spline of degree 2p — 1 is defined as the solution of the following minimization problem, 



arg mm 

/ 



8=1 Jr 



+ \ I (/«(*)) 'It 



(2) 



where A > is the smoothing parameter and f^ p \t) is the pth derivative of f(t). The parameter A 
controls the trade-off between fidelity to the data in terms of the residual sum of squares against 
smoothness of the fit in terms of the integrated squared derivative. The value of p is often taken 
to be 1 or 2, corresponding to linear and cubic smoothing spline, respectively. From frequentist 
point of view, the solution / can be explicitly derived within a reproducing kernel Hilbert space 
and A is usually estimated via cross-validation or generalized cross-validation method (see e.g., 



Wahba 


1990 


Gu 


2002) 



2002). From Bayesian point of view, the / is the mean of the posterior 



distribution of / yielded by taking a partially improper Gaussian prior taken on the function 



space (Wahba, 1978; Eubank, 1999 Speckman and Sun, 2003). 



There are two obstacles that restrict using smoothing spline estimators in practical statistical 
work. Firstly, they become computationally intractable for large data sets because the number 



of basis functions roughly equals the sample size (Wahba, 1990; Green and Silverman, 1994). 
The second obstacle stems from the smoothing parameter A. A single parameter A implies that 
the underlying mean process f(t) has a constant amount of smoothing, which is not always 
realistic in practice. It often results in the poor performance of smoothing spline, especially 
when estimating inhomogeneous functions. 



To overcome the computation issue, one approach is to employ regression splines (see Hansen 



and Kooperberg, 2002, for a comprehensive review). The basis implied by solving the spline 
smoothing problem for a small representative data set is found and this small basis is used to 



construct a model for the full data set of interest. The model is typically fitted as a linear 
or generalized linear model without imposing a roughness penalty. The covariate points that 
are used to obtain the reduced basis are known as the 'knots' of the regression spline. The 
number of knots controls the flexibility of the model, but unfortunately their locations tend to 
have a marked effect on the fitted model. Some of the problems with knot placement can be 
partially alleviated by using penalized regression splines (P-splines), where the required penalty 
is associated with the regression spline basis. It is interesting to note that there are two versions 



of P-splines, which can be distinguished by their bases and penalties in use. Eilers and Marx 



(1996) introduced the P-splines with B-spline basis and differencing penalty, while Ruppert and 



Carroll (2000) and Ruppert et al. (2003) proposed a competing method with truncated power 



basis and ridge penalty. Both P-splines have recently gained incredible popularity in statistics 



and applied fields due to their easy implementation using linear mixed model formulation (Eilers 



and Marx, 2010). O'Sullivan (1986) also introduced a similar penalized spline approach using B- 



spline basis, but with a more complicated penalty derived from the integrated squared derivative 



of the fitted curve. The O'Sullivan spline was recently revived by Wand and Ormerod (2008), 
who showed that it possess attractive features, e.g., smoothness, numerical stability and natural 



boundary properties. Simpson et al. (2012) characterized the connection between O'Sullivan 
splines, classical smoothing splines and the Markovian models considered in this paper. 

To increase its smoothing flexibility, many authors have proposed to make smoothing splines 



adaptive, e.g., local generalized cross-validation approach in Cummins et al. (2001), adaptive L- 



splines in Abramovich and Steinberg (1996), hybrid adaptive splines in Luo and Wahba (1997), 



and spatially adaptive smoothing splines in Pintore et al. (2006). There is also extensive litera- 
ture on adaptive P-splines, where a functional structure on the smoothing parameters is imposed 
in the ordinary P-spline models. The adaptive smoothing function is often chosen as another 



layer of P-spline with a set of subknots. Typical works include Lang and Brezger (2004), Bal- 



adandayuthapani et al. (2005), Brezger and Lang (2006), Crainiceanu et al. (2007), Krivobokova 



et al. (2008) and Scheipl and Kneib (2009). As their ordinary counterparts, the adaptive P- 



splines need "good" knots and subknots to provide appropriate adaptive smoothing. Several 
other spline-based adaptive smoothing methods are proposed as well, including local polynomial 



models with adaptive window widths (Fan and Gijbels, 1996), adaptive regression splines (Deni- 
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son et all 119981 IZhou and ShenL 120011 IDi Matteo et al.L 120011 Holmes and Mallickl 120011) and 



mixtures of smoothing splines (Wood et al. 2002, 2008) 



In this work, we propose a unified and efficient Bayesian approach to model smoothing splines, 
which can be easily equipped with adaptive smoothing feature. The method is based on con- 
structing Gaussian Markov random field (GMRF) representations for adaptive smoothing splines 
by solving certain stochastic differential equations. We here provide a brief introduction for 
GMRF. A random vector w = {w\, . . . , w n )' is a GMRF if it has density of form 



[w | S] oc |<5Q|y 2 exp ^— ^(tr — /j,)'Q(w — 



(3) 



where 5 > is scale parameter, /j, is mean vector, and Q is so-called precision matrix. The 
notation \A\+ denotes the generalized determinant of matrix A, which is the product of its 
nonzero eigenvalues. The full conditionals 7c(wi \ w_i), i = l,...,n, only depend on a set of 
neighbors Mi to each site i. The computational gain comes from the fact that the zero-pattern 
of matrix Q relates directly to the notion of neighbors: Qij ^ if and only if i G J\fj U j (see 



e.g., Rue and Held, 2005, Sec 2.2). The GMRFs allow for fast direct numerical algorithms, as 



numerical factorization of Q can be done using sparse matrix algorithms at a typical cost of 



0(n); see Rue and Held (2005) for detailed algorithms. Such good computational properties are 
of major importance in Bayesian inferential methods. This is further enhanced by the link to 



nested integrated Laplace approximations (INLA) (Rue et al. 2009), which allows for fast and 
accurate Bayesian inference for latent Gaussian field models. 

The connection between GMRF and smoothing splines have been explored by several au- 



thors. Speckman and Sun (2003) showed that the random walk (RW) models (a subclass of 



Fahrmeir and Wagenpfeil, 


1996 


Fahrmeir and Knorr-Held 


2000; Fahrmeir and 



Lang, [200T| ), can be used as priors to derive the discretized Bayesian smoothing spline estimator. 



Lang et al. (2002) and Yue et al. (2012) made the RW models spatially adaptive by intro- 



ducing local smoothing parameters into the models. However, all the RW models mentioned 



above are only appropriate for the data observed at regular locations. Lindgren and Rue (2008) 
considered a second-order RW (RW2) model as a discretely observed continuous time process, 
which is derived by solving a stochastic differential equation (SDE) with finite element method. 
The resulting RW2 model is resolution consistent and has a GMRF representation of the cubic 
smoothing spline, with equally good performance but more computational efficiency. 



The aim of this paper is to extend Lindgren and Rue s work in regard to spatial adaptation. 
More specifically, we enable their RW2 model to be spatially adaptive by carefully adding a 
smoothing function to the SDE. The smoothing function is able to provide various amounts 
of smoothing as required by the data. The solution of this modified SDE is thus a spatially 
adaptive smoothing spline, whose GMRF representation is explicitly available for any collection of 
locations. Compared to the existing methods, the adaptive smoothing models considered in this 
paper have a number of advantages. In particular, they have both a convenient computational 
form and a well- understood continuous limit. This not only allows for fast computation, but 
also provides the comfort that issues like knot spacing will only have a minimal and well-known 



effect on the model (see Simpson et al. 2012, for a discussion). Furthermore, they provide a 



satisfactory extension of the models in Lindgren and Rue (2008) to adaptive smoothing, which 
means that we can use the intuition built off those models, and correspondingly off RW2 models 
on regularly-spaced knots, to understand these models. 



2 Bayesian smoothing spline using SDE 



Kimeldorf and Wahba (1970) and Wahba (1978) showed that the smoothing spline / in (|2j) is 



equivalent to Bayesian estimation with a partially improper prior generated by the following 
stochastic differential equation (SDE) 

d p f(t)/dt p = dW{t)/dt, (4) 

where the function W(t) is a zero mean Wiener process with variance t, and dW(t)/dt is often 
referred to as "white noise". Letting (t)+ = t for t > and (t)+ = otherwise, the exact solution 
of SDE Q is shown to be 

f(t) = f3 + (3 1 t + --- + + Z(t)/VS, t G T, (5) 

where 5 > 0, Po, Pi, ■■■iPp ~ N(Q,£) as £ — > oo, and Z(t) is a zero mean Gaussian stochastic 
process with E[Z(s)Z(t)] = £(s, t) and 

- 1 (a - uf' 1 (t - ufr 1 



-du. 



/ (p-1)! (p-1)! 

We actually take a partially improper prior on /, which is "diffuse" on the coefficients of the 
polynomials of degree p — 1, and "proper" over the random process Z(t). Then, the / has the 



property f(t) = lim^oo E^{f(t) \ y, r, 5}, which is the expectation over the posterior distribution 
of f(t) with the prior defined in (|5]). Note that the smoothing parameter A now becomes A = 5/r. 
After taking sensible priors on r and 5, the fully Bayesian inference on / can be straightforwardly 



carried by Monte Carlo Markov chain (MCMC) method (Speckman and Sun 2003 Yue et al. 



2012) 



Unfortunately, the prior ^ is computationally intensive for large data sets because the 
covariance matrix of Z{t) is completely dense. We therefore solve SDE ^ using a finite element 



approach as introduced in Lindgren and Rue (2008). The solution will be shown to be a GMRF 



of form in (|3]). Note that we here only consider cubic smoothing spline (p — 2), which is well 
known to provide the best overall performance. Let t% < t2 < ■ ■ ■ < t n be the set of fixed 
points, which are often observed locations, but do not have to be. Define the inner product 
(/> 9) = I f(t)g(t)dt, where the integral is over the region of interest. We seek a stochastic weak 
solution of (|4j) for p = 2 that satisfies 

d 



;</>, d 2 f/dt 2 ) = (0, dw/dt) 



(6) 



for any sensible test function <p(t), where = denotes equality in distribution (Walsh, 1986). It is 
impossible to test ^ against every function 4>(t), so we chose a finite set {0j(t)}" =1 instead. 
We then construct a finite element representation of f(t) as 



w 



(7) 



for some chosen basis functions ipj and random weights Wj. Letting hj = tj + i — tj for j 
1, . . . , n — 1, a common choice of basis is the piecewise linear functions 



0, t < tj-!, 

1 - j-(t - tj), tj <t < t j+1 , 



0. 



tj+i < t. 



An interpretation of the representation ([7]) with this chosen basis functions is that the weights 
determine the values of the field at the locations, and the values in the interior of the intervals 
are determined by linear interpolation. The full distribution of the continuously indexed solution 



is determined by the joint distribution of the weights w 



,...,w n 



Finally, we let the test functions be the same as our basis functions, which is known as 
Galerkin finite element method. Substituting ([7]) into (J6]) for this set of test functions, we end 
up with a system of linear equations 

n 

wj (tpi, d^j/dt 2 ) = (ipi, dW/dt), i = 1, . . . , n. (8) 

i=i 

The finite dimensional solution is obtained by finding the distribution of w that fulfills the weak 
SDE formulation (|8]). It can be shown that the left hand side of ^ can be written as Hw, 
where H is an n x n tridiagonal matrix whose non-zero entries are 

H[ M - 1] = ^ H[M] = - + £) , H[M + 1] = I (9) 

for 2 < i < n — 1, since ipi only overlap for neighboring basis functions. The entries of the first 
and last row in H are zeroes. Given the statistical properties of white noise, the inner product 
on the right-hand side of (pi) is a Gaussian distribution with zero mean and covariance matrix 
B = i/jj)]ij = i, whose nonzero entries are given by 

B[i,i - 1] = — , B[i,i] = , B[i, i + l] = —, 



with modifications at the boundaries. To achieve distribution equality in (|8j), the random vector 
w has the density of form ^ with = and Q = H'B ~ X H. However, such Q is the dense 
matrix due to the dense B' 1 , making the Galerkin model computationally expensive. Lindgren 
and Rue (2008) showed that without changing the solution we may replace B by a diagonal 
matrix B with B[i,i] = (ipi, 1), giving 

B[l,l] = £, BM = + B[nM = k -^. (10) 

As a result, the matrix Q = H'B X H becomes sparse and w is thus a GMRF. It is straight- 
forward to verify that Q has rank n — 2, with the null space spanned by vectors (1, . . . , 1) T 
and (t\, . . . ,t n ) T . It indicates that the resulting field is invariant to addition of a linear trend, 



coinciding with the result obtained by Wahba (1978) for cubic smoothing spline. 

We have now derived a GMRF w as the weights of a basis function expansion ([7]), which 
approximates the continuous function f(t) everywhere. Simpson et al. (2012) showed that the 
convergence of the approximation depends solely on the basis functions. Given any set of enough 
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points ti, using the piecewise linear functions yields the best finite approximation to the continuos 
process regardless of their locations. Also, the method described above works for any set of test 



and basis functions when all of the computations make sense. Actually, |Simpson et al.| showed 
that the O'Sullivan spline can be exactly derived by solving the SDE in Q using cubic B-splines 
as basis functions and their second derivatives as test functions. However, one should be aware 
that the wrong choice of global basis functions will destroy the Markov structure, and not all 
sets of basis functions will provide good approximations to /(£). 



3 Extensions to adaptive smoothing spline 

Besides their intriguing theoretical and computational properties, one of the most exciting aspects 
of the SDE spline models is their flexibility: it is straightforward to extend them to adaptive 
smoothing spline models. The basic idea is that by making the smoothing parameter vary in 
space, we will be able to control the local smoothing properties of the spline. We here present 
two different adaptive SDE formulations, from both of which we are able to derive the GMRF 
models that provide appropriate adaptive smoothing. 



3.1 Adaptive SDE I 

One way to extend SDE Q is as follows: 

X{t)d 2 f(t)/dt 2 = dW{t)/dt, (11) 

where the positive X(t) can be seen as an adaptive smoothing function, compared to the global 
smoothing parameter A in ordinary smoothing splines. A small X(t) allows big second derivative 
of /(£) for roughness, while a large value diminishes the derivative to increase smoothness. The 



solution to (11) is related to the spatially adaptive smoothing spline introduced in Pintore et al. 



(2006), minimizing 



E(w -/(**))'+ / 

i=l J 



[X(t)f"(t)] 2 dt. (12) 



Using a piecewise-constant model for A(t), Pintore et al. derived closed-form solutions for the cor- 



responding reproducing kernels of the Hilbert space. Their method, however, is computationally 
intensive since the matrix of reproducing kernel is completely dense. 
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Following the non-adaptive case, we seek a weak solution of (11) by achieving 

(ip il Xd 2 f /dt 2 ) = (fadW/dt), i = l,...,n. (13) 
Using the basis representation in ^ as well as Galerkin approximation, the left hand side of 



(13) can be proved to be AHw, where A is a diagonal matrix of A = (A(ti), . . . , X(t n )) and 



H is the matrix as in ^ (see Appendix for the proof). Since the right-hand side of (13) is the 



same as in (|8j), the w is also a GMRF with zero mean and the following precision matrix 

Qx = H'AB^AH. 

It is easy to see that Qx is symmetric and banded with non-zero entries of ith row given by 

Qa[M-2] = 7 7 77 1 Z v Qa[M~1] = -72- "7 + 



Qx[i,i] 



hi-ihi-xQii-2 + hi-xY ' \ hi-2 hi 

2A 2 (t,_ 1 ) , 2X 2 (t t ) (I 1 \ 2X 2 (t l+l ) 



h 2 _ x {hi-2 + hi-x) h^ihi \hi-x hi) h 2 (hi + h i+1 )' 
At the discretization boundaries, we use the convention that terms with non-existing components 
are ignored, that is h^x = h = h n = h n+1 = oo. This affects only the upper left and lower right 
corner of Qx as follows: 

„ r i 2A 2 (t 2 ) „ . . 2A 2 (t 2 ) 
Qa[1,1]= ztt-u —u o Qa[2,1]- 



^ 2 (/ii + /i 2 )' L ' J ^ 2 
n 2A 2 (t 3 ) 2A 2 (t 2 ) / 1 1 

9a[2 ' 2J " ^(^2 + ^3) + "wT + ^ 

n[ . 2A 2 (t w , 2 ) 2A 2 (t ra _ 1 ) / 1 1 
Q A [n-l,n-l] = 7^7 77 7 + 7 E 7 + 



hn-2(hn-3 + h n -2) ^n-2^n-l \h n -2 

^ r i 2A 2 (t„_i) 2A 2 (t n _i) 
QaF, n] = to — 77 tt v Qx[n, n-l\- 



+ h n -xY ' h n _2h 2 n _Y 

Note that Qx does not involve A(ti) or X(t n ) because the first and last rows of H are zeroes. 



3.2 Adaptive SDE II 

An alternative SDE that we can use for adaptive smoothing is 

d 2 X(t)f(t)/dt 2 = dW(t)/dt, (14) 

where X(t) can be seen as a instantaneous variance or local scaling, which compress and stretch 
the function. A small X(t) compresses the scale giving quick oscillations, while a high value 

9 



stretch f(t), decreasing the roughness. Adopting notation f(t) = X(t)f(t), formulation (14) 
corresponds to minimizing 



E fa -/(**)) 2 + / /" 
i=i ^ 



The weak solution of ( 14 ) can also be found using Galerkin method to satisfy 



(Vi, d 2 Xf/dt 2 ) = (iPi, dW/dt), i = l,...,n, 



(15) 



whose left-hand side can be written as HAw, where H and A are defined as above (see Appendix 
for the proof). Again, the w is a GMRF with zero mean and precision matrix 



Qx = AH B HA 

whose nonzero entries can be explicitly written out as 

2X(t l _ 2 )X{t l ) 



Qa[M-2] 



QaM 



hi- 2 hi-i(hi-2 + hi-!)' 



2X 2 (U 



+ 



2X 2 (U 



Q\[i,i- 1] 



1 1 

+ 



2\(t l _ 1 )X(U 

h 2 

u i-i 

2X 2 (t l ) 



i-2 



1 

hi 



h1_ x {hi-2 + hi-x) hi-ihi \hi-i K J h 2 (hi + h i+1 )' 
with corrected boundary entries 

2A(*x)A(* 2 ) 



Qa[1,1] 
Qx%2] 



2X 2 (h) 



h\{hx + h 2 



Qx[2,l] 



2X 2 (t 2 ) 2X 2 (t 2 ) 



hl(h 2 + h ?) 



h\h 2 
1 1 



h\h 2 \hi h 2 



Qx[n - l,n - 1] 
Q x [n,n) 



2X 2 {t 



n—l , 



+ 



2A 2 (t 



n-l; 



hn- 2 (h n -3 + h n ^ 2 ) h n - 2 h n ^i \h n _ 2 /i n -i 
2X 2 (t n ) . , 2A(t„_i)A(t n ) 



h\_ x (h n _ 2 + h n _ x ) 



, Q x [n,n-1) 



hn-ih 2 n _ x 



3.3 Modeling adaptive smoothing function 

To implement fully Bayesian inference, we need a prior taken on the smoothing function X(t), 
which is assumed to be continuous and differentiable. Since it is restricted to be positive, we 



model X(t) on its log scale: u(t) = log(A(t)). Yue and Speckman (2010) and Yue et al. (2012) 



have proved that the prior on v(t) must be proper in order to guarantee a proper posterior for 
such adaptive smoothing models. 
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It is intuitive to model v(t) in a similar way to /(£). We therefore follow the basis expansion 
in ([7]) and represent v(t) as a weighted sum of m basis function Uk(t), that is 

m 

fe=i 

with random weights 7 = (71, . . . , 7™)'. Unfortunately, the previous GMRF prior cannot be put 



on 7 since it is intrinsic. Lindgren et al. (2011) derived an explicit link between GMRF and 



common Gaussian fields by considering SDE 

{k 2 - d 2 /dt 2 ) u{t) = dW{t)/dt, (16) 



where k > is fixed. Again, we use Galerkin method to weakly solve (16) as 



^-f k (uj e ,K 2 - d 2 u k /dt 2 ) = (u e ,dW/dt), £ = l,...,m. (17) 



k=l 



With piecewise linear basis, it can be shown that the left hand side of (17) is (kB — H)-y and 
the right hand side is a Gaussian random vector as before. As a result, the precision matrix of 
the corresponding GMRF is given by 

R = (k 2 B - H)'B-\k 2 B -H) = k 4 B - k 2 (H' + H) + H'B l H. 

To make R sparse, we replace B by B as before. This GMRF prior is proper and it is getting 
intrinsic as n goes to zero. Due to the computational advantage of GMRF, it is feasible to use 
full-rank basis expansion (m = n) to make the method fully automatic. 



4 Posterior inference 

The fully Bayesian inference requires the hyperpriors on parameters r, 5 and rj. We choose diffuse 
but proper gamma priors, i.e. Gamma(e, e) for e = 0.001. Then, the joint posterior distribution 
of both adaptive smoothing spline models can be written as 

[y I w,t][w I 5, 7] [7 I 77] WIN- 

To obtain the posterior distribution, we here present two different approaches. They are sim- 
ulation method via Monte Carlo Markov chain (MCMC) and approximation method based on 
integrated nested Laplace approximation (INLA). 

11 



4.1 MCMC approach 

Let ^ = {i) jiti)}™ j=\ and f2 = {^fc(^)}^=i be the matrices of basis functions for f(t) and v(t), 
respectively. Then, the hierarchical models have the following structure: 

y = ^w + e, £ ~ iV(0,T _1 I), 

[w | 5,X] oc |5Q A |+ /2 exp ^-^u/Q A it^ , 

log(Ai) = u i} v = fl-y, 

[Thlocl^l^exp^lV^), 

r ~ Gamma(a T , b T ), 

5 ~ Gamma(a5, 65), 

i] ~ Gamma(a„ 6,,). 

We here focus on how to sample 7 from its full conditional because the rest sampling procedures 
are straightforward. As we can see, the full conditional of 7 is not a regular density, so we have 
to employ Metropolis-Hastings sampling technique. We here present an efficient algorithm to 
sample 7 when using the first adaptive SDE. Unfortunately, we have not found an equivalently 
efficient method for the second adaptive SDE, which, however, can be taken care of by INLA 
method as described in next section. 

A good proposal distribution is the key to the successful Metropolis-Hastings algorithm. It is 
helpful to see that the GMRF derived from the first adaptive SDE can be written as a random 
walk model, i.e., 

[w I 5, 7] oc f[ (5e^) 1/2 exp {-^1 ) , 
i=i ^ ' 

where w = (0, w 2 , ■ ■ ■ , w n -i-> 0)' = Hw (note the first and last rows of H are zeroes). Since 7^ 
depends on Wi only, it is possible to construct an accurate GMRF approximation for the full 
conditional of 7 given by ^(7 | w,5,rj) oc [w | 5, 7] [7 | 77] as follows. First, we approximate 
[wi I 5, 7j] using Taylor expansion at ja, 

K I <*»7»] ~ ex P + Hloihi ~ \ c i(.loi)l 2 i 
where aj is the nuisance parameter, fej(7oi) = 1/2 — 5e 70i (l — •y 0i )wf/2 and q(7 i) = 5e t{ - H wf/2. 
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Letting b = {pi, ... , b n )' and c = (ci, . . . , c n )', then the density 

V{j | 70, 8, 7]) oc exp f-^'^T + Yl( Qi + b ^oiH ~ ^(ToOt?^ J 

oc exp (^-]p'{r]R + diag(c))7 + b'j^j , 

is a GMRF approximation to J-{"y \ w,8,T]). In order to make the approximation accurate, we 
choose 70 to be the mode of V{p( | 70, w, 8, if), which can be obtained using, say Newton- Raphson 
method. Using the GMRF approximation as proposal distribution, we can update the whole 7 
by accepting proposal 7* with probability 

min ( 1 



Hi* 


1 w,8,rj)V{j 






w, 5, rj)V(-y* 


lo,w,8,rj) 



Other full conditionals are given by 

(w I 7,<5,r) ~ N(nw,^ w ), Hw = T?: w yy and Z w = {rW + SQx)' 1 
(r I w) ~ Gamma(n/2 + a T , \\y - *$>w\\ 2 /2 + b T ) 
(5 I A, w) ~ Gamma(n/2 — 1 + a s , w'Qxw/2 + b s ) 
(77 I 7) ~ Gamma(n/2 — 1 + a v , j'Rj/2 + b v ), 

all of which can be easily sampled. 
4.2 INLA approach 



Rue et al. (2009) have developed the R computer package INLA for Bayesian inference using 
integrated nested Laplace approximations. The INLA can handle general Gaussian hierarchi- 
cal models, including the both adaptive smoothing spline models developed in this paper. It 
accurately approximates marginal posterior densities and computes estimates much faster than 
general MCMC techniques. 

The general Gaussian hierarchical models have a set of hyperparameters 6 with prior 7r(#), 
a latent variable / with density 7r(f\0) and an observed response y with likelihood 7i{y\f,6). 
The posterior is then given by 

*(/,%) oc7r(y|/,0)7r(/|0)7r(0). 
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We need to find the posterior marginals n(fa\y) and ir(8j\y), which can be done using INLA. 
The approach is based on the following approximation for the posterior marginal of 6: 

7r(f,0,y) 



n(0\y) oc 



/=/*(») 



where 7Tg(/|0, y) is the Gaussian approximation to the full conditional of /, and f*{0) is the 
mode of the full conditional of /. The approximated marginals are then constructed as follows: 

m\y) = f Ho\y)dO- v 
Hfa\y) = j *{fi\o,v)n{o\y)do, 

where 0_j denotes a subvector of without element 9j. The approximated marginal of 8j 
can be obtained by summing out the remaining variables 6_j from jr{0\y). The approximated 
marginal of fi is obtained by, first, approximating the full conditional of fa with another Laplace 
approximation: 

7r(f,e,y) 



7t(fa\e,y) (x 



KGG(f-i\fi,0,y) 



f-i=fU(f-i> ) 



where ttgg is the Gaussian approximation to f-i\fa, 0, y and 0) is the mode configuration. 

Then, we numerically integrate out the parameters from 7r(fi\0,y). This nested approach 
makes the Laplace approximations very accurate. 

However, INLA has a limitation that is it only works when the number of hyperparameters 
in is small, say less than 15. The reason is that it becomes extremely expensive to numerically 
integrate out as its dimension increases. In our case, the hyperparameters 6 = (7, r, 5, if). As 
a result, we have to use reduced-rank basis to model 7 if we want to fit the models with INLA. 



5 Simulated examples 

In this section we consider three functions: a slowly-varying smooth function, a function with a 
sharp peak, that is spatially inhomogeneously smooth, and a highly-oscillating Doppler function. 
Gaussian noise is added to each in generating the data. The functions together with samples of 
data are shown in Figure [TJ In Example 1, the true function is a spline with three internal knots 
at (0.2, 0.6, 0.7) and coefficients (20, 4, 6, 11, 6). The function is evaluated on a regular grid of 
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(a) Example 1 (b) Example 2 (c) Example 3 




20 40 60 80 1 00 20 40 60 80 1 00 50 1 00 1 50 200 



Figure 1: The three true functions used in the simulation study together with one sample. 

101 points, and a zero-mean Gaussian noise is added to the true function with standard deviation 
0.9. In Example 2, the true function is f(t) = sin(t) + 2exp(— 30t 2 ) for t G [—2,2], evaluated at 
101 regularly spaced points, and the standard deviation of the noise is 0.5. In Example 3, the 
Doppler function is given by f{t) = y/t(l-t) sin(27r(l + e)/(t + e)) for e = 0.125, evaluated at 
201 regularly spaced points, and the standard deviation of the noise is 0.2. 

We compare our Bayesian adaptive smoothing spline (BASS) estimates with ordinary smooth- 
ing spline (OSS) estimates, using mean squared error 

1 n 2 

MSE=-5^[/(t i )-/(*0 • 
n . l 

i=i 

The BASS model derived from the first adaptive SDE is fitted by MCMC while the one from the 
second adaptive SDE is estimated by INLA. Note that with MCMC we use the same number of 
knots as the data points, while with INLA we respectively use 3, 5 and 10 knots for the three 
examples. The median mean squared error, together with first and third quartile, based on 200 
samples of data is reported in Table [T} As we can see, the OSS model slightly outperforms the 
two BASS models when estimating the slowly-varying smooth function, but the BASS models 
significantly work better in the peak and Doppler functions that are more spatially adaptive. 
It is interesting to see that two different BASS models, which are fitted by different methods, 
yield quite similar average MSE's. It indicates that the both SDE formulations offer appropriate 
adaptive smoothing, and INLA makes as accurate inference as MCMC does with much faster 
computation. The only limitation of INLA, as mentioned, is that it only works when there are 
a small number of hyperparameters to estimate. Therefore, INLA could be a better inferential 
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Example 1 


Example 2 


Example 3 


BASS-vl 


0.0620 (0.0444, 0.0854) 


0.0297 (0.0219, 0.0420) 


0.0072 (0.0061, 0.0084) 


BASS-v2 


0.0633 (0.0468, 0.0886) 


0.0274 (0.0206, 0.0377) 


0.0072 (0.0058, 0.0088) 


OSS 


0.0600 (0.0401, 0.0853) 


0.0408 (0.0336, 0.0531) 


0.0092 (0.0082, 0.0102) 



Table 1: Simulation study. Median MSE with first and third quartiles in brackets based on 
200 samples obtained using BASS and OSS procedures. The BASS-vl and BASS-v2 denote the 
models derived from the first and second adaptive SDEs, respectively. 



tool than MCMC for BASS models if only a few knots are needed to capture the structure of 
the adaptive smoothing function. 



6 Real data example 

To illustrate the techniques developed so far, we now consider the data presented in Figure |2j 
These observations consist of accelerometer readings taken through time in an experiment on 



the efficacy of crash helmets. The data set was used by Silverman (1985) and is available in R 
software package. For various reasons, the time points are not regularly spaced, and there are 
multiple observations at some time points. In addition the observations are all subject to error. 
It is of interest both to discern the general shape of the underlying acceleration curve and to 
draw inferences about its minimum and maximum values. But, for illustrative purposes we shall 
concentrate on estimating the general shape only. 

It is clear from Figure [2] that the variance of the data is not constant over time. To take 
into account this heteroskedastic property, we modify model ([T]) by adding random weights to 
the errors, that is £, ~ A^(0, r -1 /^ 1 ) for i = l,...,n. Again, we take diffuse gamma prior on 
r. Regarding pi, we use independent gamma prior with both shape and scale being half, i.e., 
Pi ~ Gamma(0.5, 0.5). If integrate each pi out of £j, we can see that £j follows an independent 
Cauchy distribution, which is able to provide flexible shrinkage due to its heavy tails and sharp 
peak. Such modifications on errors can be easily incorporated into the adaptive smoothing spline 
model by only adding the step of sampling pi to the MCMC algorithm. 

The effect of applying adaptive smoothing technique and Cauchy errors is shown in Figure 
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Figure 2: The motorcycle impact data. 



3j We here present four different fitted curves and their 95% credible intervals: (a) OSS with 
Gaussian errors; (b) BASS with Gaussian errors; (c) OSS with Cauchy errors; (d) BASS with 
Cauchy errors. Note that we fit the BASS model derived from the first adaptive SDE using 
MCMC since the other model yields similar performance in the simulation study. As we can see, 
all the fits give a clear indication of the general pattern of the data, which is constant at first and 
then drops sharply, followed by a rebound above its original level before setting back. Compared 
to the OSS models, the BASS models show attractive adaptive smoothing features: the fits are 
smoother near the left and in the right half of the picture, while they yield lower drops in the 
middle. Compared to the Gaussian errors, the Cauchy errors make the fit follows the data more 
closely and offers a more reasonable credible interval (being narrow on the left end but wide in 
the right half), which captures the variance pattern well. In our opinion, the BASS model with 
Cauchy errors gives the best overall fit. 
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(a) OSS with Gaussian errors 



(b) BASS with Gaussian errors 
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(c) OSS with Cauchy errors 
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(d) BASS with Cauchy errors 
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Figure 3: The fitted curved with their 95% credible intervals, constructed from the motorcycle 
impact data. 
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7 Conclusion 



In this paper we have developed a unified Bayesian approach to model adaptive smoothing splines. 
It is based on the connection between smoothing splines and stochastic differential equations. 



We showed that the SDE approach in Lindgren and Rue (2008 ) can be easily adapted to adaptive 



smoothing problems. Using the finite element method, the GMRF representations of the adaptive 
smoothing splines were explicitly derived. Furthermore, we proposed efficient MCMC and INLA 
algorithms to make Bayesian inference. Finally, we demonstrated the effectiveness of our method 
through a simulation study and an application to the motorcycle data. 

A Appendix 

This section includes detailed proofs for the weak solutions of both adaptive SDEs. 
A.l Adaptive SDE I 

Using basis expansion (m), the adaptive SDE (13) becomes a linear equation system, whose left 
hand side can be written as H\iv. We here show how to derive the non-zero entries of matrix 
H\. Using integration-by-parts, we have [i,j]th entry of H\ as 

H x [i,j] = Ui{t),\{t)M{t) 



\(t)^(t)^(t)dt 
h 

X(t)^(M(t)\l: - [ tn [X(t)^(t)]'Mt)dt 



I i r, 



ti 



ti Jti 
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Since basis ipi only overlap for neighboring locations, the nonzero entries in ith row of H\ are 
H\[i, % — 1], H\[i, i] and H\[i, i + 1] for i = 2, . . . , n — 1. Specifically, we have 

H[i,i-i] = - t x'^MM-^dt- r KMit^um 
Jti-i Jti-i 

-\(t)Am_ 1 (t)\l_ i + t \{t)[UtWi-M'dt- t \(M(M-i(t)dt 



X(ti)/h 



Note that V'i-iW is constant between U-± and U, and thus we have [ipi(t)ip'i-i(t)]' = 
Similarly, we have 



Ha[M] 



1+1 \'{t)^ l {t)^ i {t)dt - / !+1 \{t)4{t) 2 dt 

U-l Jti-l 

r x'itWiitwwdt- r \{t)^{tfdt 



= -A(t 



\{ti)ipi(ti) - A(^_i)^i(^-i) A(* i+ i)^i(* i+ i) - A(*i)^i(*i) //ij 

1 1 



- \ f mm +1 (x)dt- f 1+1 xmm +1 (t)dt 
ju Ju 

= - \(t)UtH + i(t)\l +1 + f 1+1 \(M(M + i(t)dt - f 1+1 KM(M +1 (t)dt 

Jti Jti 

= -\{t i+1 )^i{t i+1 )ip' i+1 {t i+1 ) + A(t i )^ i (*i)^ +1 (t i ) 
= A(*i)/^. 

For first and last row of H\, the (possible) nonzero entries are H\[l, 1], if>[l,2], H\[n,n — 1] 
and ffx[^, which happen to be zeroes due to the intrinsic condition. We here only show the 
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derivation of H\[l, 1] and H\[l, 2], and the other entries can be obtained similarly. We have 
H x [l,l] 



H x [l,2] 



Ju Ju 



o. 



t-2 



t2 



-A(*i)v>i(ti)^(*i) + KtJMtiWM 
o. 



Finally, we can easily see that H\ = AH, where A is the diagonal matrix of A(-)'s and H is the 
tridiagonal matrix defined as in (|Ql) . 



A. 2 Adaptive SDE II 



Letting / = X(t)f(t), the left hand side of (15) can be written as 



A(t)f(t)dt 



ti(t)f(t)dt 



Ait) X\t)f(t) + X(t)f(t) 



dt 



Using basis expansion ([7]), the adaptive SDE (15) becomes a linear equation system, whose left 
hand side can be written as H\w. We then have [i, j]th entry of H\ as 

pt n pt n 

H x [i,j] = A'(ttyi(f)tfc(t)|£ + x(t)UtWMt - / x'iMiWMdt - / x(M(t)^(t)dt. 
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Since basis ipi only overlap for neighboring locations, the nonzero entries in ith row of H\ are 
H\[i, % — 1], H\[i, i] and H\[i, i + 1] for 2 = 2, . . . , n — 1. Specifically, we have 



H[i,i-1] = - T A'(t)^(*)^i-i(*)d* - /' \(M(M-i(t)dt 



ti-1 J ti^i 

= -A(t)#(fM-i(*)it 1 + r a(*)[^(o^-i(o]'*- r a(o^(oci(o* 

j ti-i Jti-i 

j ti-i j ti-i 

= -A(fi)$(ti)^i-i(*i) + A^-O^^-O^-i^-i) 
= A(^_i)//ij_i. 

Note that ■?/>•(£) is constant between and ij, and thus we have [^(t^i-i^)]' = ■4>' i (t)il) l i _ 1 (t) . 
Similarly, we have 

jt A [m] = - r +i x'iMit^dt - r +i A(t)^(t) 2 ^ = -A^) c^- + ^ , 

Jti— 1 J ti—i \ i 1 fc/ 



which is the same as in the previous case, and 

Ki+l pti+1 



+ = -f +1 A'(t)^(^ t+1 (t)dt-/' +1 A(^(^ +1 (J)(lt 

= -A(t)^(t)^ + i(t)iS; +1 + t +1 \(M(M + i(t)dt - t +1 \(M(M +1 (t)dt 

Jt, Jt, 



= -\(t i+ M(t i+1 )il; i+1 (t i+1 ) + A(* i )^(* i )^i+i(*i) 
= \{t i+l )/hi. 

For first and last row of H\, the (possible) nonzero entries are H\[l, 1], H\[l,2], H\[n,n — 1] 
and H\[n,n], of which the first two entries can be derived as 

ft 2 /-t2 



(*)^i(t)dt- / A(t)^i(t) 2 cit 



= -A'(ti)^ a (ti) - A(ti)^i(*i)V^(*i) - A(*i)#(0lM*)l£ 
= -A'(ti)^?(*i) - A(ti)^i(*0^i(*i) + A(*i)^(*i)^i(*i) 
= -A'fa), 

h x [i,2] = \>(t)Mt)Mt)\l; + HWiWMlZ - /'aWiCW')*- [ 2 \(t)^(t)^ 2 (t)dt 

Jti Jti 

= -A'(*i)^i(fi)^ 2 (*l) - A(ti)^i(fi)V>2(*l) - A(t 2 )^i(t 2 )^2(t 2 ) + A(ti)^i(*i)^2(ti) 

= -A(ti)//n + A(t 2 )//ii. 
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Similarly, the last two entries are given by 

H x [n - l,n] = A(* n _i)//i n _i - \{t n )/h n -i and H\[n,n] = X'(t n ). 

These four entries can be viewed as (at least approximately) the derivatives of A(t) at the 
boundary points. To be consistent with the previous case, we assume the Neumann boundary 
condition: A'(ti) = A'(i„) = 0, to make the entries be zeroes. Then, we can easily see that 
H\ = HA, where A is the diagonal matrix of A(-)'s and H is the matrix defined as in Q. 
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